Data Analytics and Machine Learning - Individual Assignment¶
This notebook explores the books dataset obtained by scraping BooksMandala
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from wordcloud import WordCloud, STOPWORDS
from sklearn.cluster import SpectralClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sentence_transformers import SentenceTransformer
from scipy.sparse import hstack, csr_matrix
import umap
import ast
import re
from tqdm.autonotebook import tqdm, trange
2024-11-08 22:20:30.132707: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Dataset Loading¶
filepath: str = "/home/am/booksmandala-data-analytics/notebooks/data/dataset.csv"
df = pd.read_csv(filepath)
df.head()
| Title | Author | Price | Rating | Limited Stock | Discount | Genre | Number of Pages | Weight | ISBN | Language | Related Genres | Subgenres | Synopsis | URL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | The Gruffalo | by Julia Donaldson | Rs. 720 | NaN | Only 3 item left in stock! | NaN | Arts And Photography | 33 Pages | 196g | 9781509804757 | English | Kids and Teens, Arts and Photography, Kids and... | Ages 3 to 5\n, Picture Books\n, Ages 3 to 5, P... | A mouse took a stroll through the deep dark wo... | https://booksmandala.com/books/the-gruffalo-12894 |
| 1 | Tibetan Pilgrimage :Architecture of the Sacred... | by Michel Peisel | Rs. 1200 | NaN | NaN | NaN | Arts And Photography | NaN | 1050g | 9780810959446 | English | Arts and Photography, Miscellaneous, Arts and ... | Architecture\n, Books on Tibet\n, Architecture... | With nearly a hundred exceptional watercolor i... | https://booksmandala.com/books/tibetan-pilgrim... |
| 2 | The Sacred Mountain | by Dalai Lama Xiv Bstan-ʼDzin-Rgya-Mtsho and J... | Rs. 1592 | NaN | NaN | NaN | Arts And Photography | 457 Pages | 970g | 9788120831520 | English | Travel, Arts and Photography, Travel, Arts and... | Climbing and Mountaineering\n, Picture Books\n... | (4) Truth of the path leading to the annihilat... | https://booksmandala.com/books/the-sacred-moun... |
| 3 | The Inner Game of Music | by Barry Green and W. Timothy Gallwey | Rs. 1040 | NaN | Only 6 item left in stock! | NaN | Arts And Photography | 248 Pages | 200g | 9781447291725 | English | Arts and Photography, Self Improvement and Rel... | Music\n, Self Help\n, Psychology\n, Music, Sel... | The bestselling guide to improving musical per... | https://booksmandala.com/books/the-inner-game-... |
| 4 | Hooked: How to Build Habit-Forming Products | by Nir Eyal and Ryan Hoover | Rs. 1118 | NaN | NaN | NaN | Arts And Photography | 242 Pages | 340g | 9780241184837 | English | Business and Investing, Arts and Photography, ... | Business\n, Design\n, Psychology\n, Self Help\... | How do successful companies create products pe... | https://booksmandala.com/books/hooked-how-to-b... |
df.describe()
| Rating | |
|---|---|
| count | 252.000000 |
| mean | 4.440873 |
| std | 0.813651 |
| min | 1.000000 |
| 25% | 4.000000 |
| 50% | 5.000000 |
| 75% | 5.000000 |
| max | 5.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2840 entries, 0 to 2839 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title 2840 non-null object 1 Author 2840 non-null object 2 Price 2840 non-null object 3 Rating 252 non-null float64 4 Limited Stock 1729 non-null object 5 Discount 41 non-null object 6 Genre 2840 non-null object 7 Number of Pages 2640 non-null object 8 Weight 2840 non-null object 9 ISBN 2840 non-null object 10 Language 2840 non-null object 11 Related Genres 2840 non-null object 12 Subgenres 2655 non-null object 13 Synopsis 2835 non-null object 14 URL 2840 non-null object dtypes: float64(1), object(14) memory usage: 332.9+ KB
Not much numeric data to work with
Preprocessing¶
# show data types that are non-numeric
df.select_dtypes("object").columns
Index(['Title', 'Author', 'Price', 'Limited Stock', 'Discount', 'Genre',
'Number of Pages', 'Weight', 'ISBN', 'Language', 'Related Genres',
'Subgenres', 'Synopsis', 'URL'],
dtype='object')
na = df.isna().sum()
na[na > 0]
Rating 2588 Limited Stock 1111 Discount 2799 Number of Pages 200 Subgenres 185 Synopsis 5 dtype: int64
Price on BooksMandala¶
Books on BooksMandala are often on sale or having discounts. While scraping for price, the entire text is extracted, including the discount amount and original price.
df["Price"].unique()
array(['Rs. 720', 'Rs. 1200', 'Rs. 1592', 'Rs. 1040', 'Rs. 1118',
'Rs. 1500', 'Rs. 2500', 'Rs. 2238', 'Rs. 1438', 'Rs. 700',
'Rs. 695', 'Rs. 958', 'Rs. 880', 'Rs. 256', 'Rs. 2078', 'Rs. 800',
'Rs. 142', 'Rs. 2560', 'Rs. 4800', 'Rs. 798', 'Rs. 960', 'Rs. 558',
'Rs. 2800', 'Rs. 11360', 'Rs. 640', 'Rs. 1598', 'Rs. 300',
'Rs. 4000', 'Rs. 3000', 'Rs. 478', 'Rs. 638', 'Rs. 398', 'Rs. 96',
'Rs. 4480', 'Rs. 472', 'Rs. 318', 'Rs. 560', 'Rs. 3200',
'Rs. 2000', 'Rs. 366', 'Rs. 200', 'Rs. 288', 'Rs. 2718', 'Rs. 392',
'Rs. 5120', 'Rs. 650', 'Rs. 2160', 'Rs. 1360', 'Rs. 1278',
'Rs. 760', 'Rs. 400', 'Rs. 176', 'Rs. 4649', 'Rs. 238', 'Rs. 125',
'Rs. 1520', 'Rs. 1150', 'Rs. 3417', 'Rs. 2870', 'Rs. 70',
'Rs. 999', 'Rs. 3198', 'Rs. 3358', 'Rs. 2225', 'Rs. 260',
'Rs. 1995', 'Rs. 1917', 'Rs. 705', 'Rs. 1600', 'Rs. 4318',
'Rs. 1918', 'Rs. 1758', 'Rs. 360', 'Rs. 280', 'Rs. 224', 'Rs. 375',
'Rs. 2398', 'Rs. 358.5 Rs. 478( 25% OFF)', 'Rs. 840',
'Rs. 240 Rs. 320( 25% OFF)', 'Rs. 780 Rs. 1040( 25% OFF)',
'Rs. 399', 'Rs. 824', 'Rs. 552', 'Rs. 350', 'Rs. 499', 'Rs. 312',
'Rs. 295', 'Rs. 558.6 Rs. 798( 30% OFF)',
'Rs. 598.5 Rs. 798( 25% OFF)', 'Rs. 670.6 Rs. 958( 30% OFF)',
'Rs. 504 Rs. 720( 30% OFF)', 'Rs. 196 Rs. 280( 30% OFF)',
'Rs. 838.5 Rs. 1118( 25% OFF)', 'Rs. 2470', 'Rs. 455', 'Rs. 95',
'Rs. 875', 'Rs. 1000', 'Rs. 382.8 Rs. 638( 40% OFF)', 'Rs. 6718',
'Rs. 718.5 Rs. 958( 25% OFF)', 'Rs. 3518', 'Rs. 2558',
'Rs. 218.4 Rs. 312( 30% OFF)', 'Rs. 500', 'Rs. 600', 'Rs. 666',
'Rs. 1958', 'Rs. 280 Rs. 400( 30% OFF)',
'Rs. 262.5 Rs. 350( 25% OFF)', 'Rs. 550', 'Rs. 450', 'Rs. 240',
'Rs. 850', 'Rs. 100', 'Rs. 2295', 'Rs. 1425', 'Rs. 150', 'Rs. 250',
'Rs. 1950', 'Rs. 750', 'Rs. 675', 'Rs. 60', 'Rs. 90', 'Rs. 380',
'Rs. 525', 'Rs. 632', 'Rs. 1700', 'Rs. 545', 'Rs. 495', 'Rs. 50',
'Rs. 160', 'Rs. 1750', 'Rs. 456', 'Rs. 507', 'Rs. 480', 'Rs. 645',
'Rs. 190', 'Rs. 680', 'Rs. 1349', 'Rs. 608', 'Rs. 752', 'Rs. 85',
'Rs. 110', 'Rs. 993', 'Rs. 816', 'Rs. 330', 'Rs. 792', 'Rs. 75',
'Rs. 65', 'Rs. 660', 'Rs. 330.4 Rs. 472( 30% OFF)', 'Rs. 998',
'Rs. 952', 'Rs. 580', 'Rs. 360 Rs. 480( 25% OFF)',
'Rs. 105 Rs. 150( 30% OFF)', 'Rs. 158', 'Rs. 230', 'Rs. 78',
'Rs. 130', 'Rs. 336', 'Rs. 112', 'Rs. 298', 'Rs. 275', 'Rs. 175',
'Rs. 320', 'Rs. 195', 'Rs. 395', 'Rs. 325', 'Rs. 770', 'Rs. 25598',
'Rs. 145', 'Rs. 3192', 'Rs. 140', 'Rs. 299', 'Rs. 1440', 'Rs. 105',
'Rs. 490', 'Rs. 152', 'Rs. 115', 'Rs. 1112', 'Rs. 3652', 'Rs. 995',
'Rs. 530', 'Rs. 704', 'Rs. 180 Rs. 240( 25% OFF)', 'Rs. 832',
'Rs. 795', 'Rs. 72', 'Rs. 128', 'Rs. 225', 'Rs. 290', 'Rs. 2878',
'Rs. 3038', 'Rs. 1344', 'Rs. 1760', 'Rs. 672', 'Rs. 1272',
'Rs. 448', 'Rs. 1116', 'Rs. 446.6 Rs. 638( 30% OFF)', 'Rs. 1120',
'Rs. 2240', 'Rs. 1568', 'Rs. 440', 'Rs. 1232', 'Rs. 896',
'Rs. 1840', 'Rs. 950', 'Rs. 1680', 'Rs. 80', 'Rs. 22398',
'Rs. 1250', 'Rs. 2320', 'Rs. 6398', 'Rs. 3280', 'Rs. 2640',
'Rs. 19198', 'Rs. 6800', 'Rs. 304', 'Rs. 1960', 'Rs. 2880',
'Rs. 625', 'Rs. 2544', 'Rs. 475', 'Rs. 4792', 'Rs. 120', 'Rs. 624',
'Rs. 520', 'Rs. 104', 'Rs. 432', 'Rs. 170', 'Rs. 384', 'Rs. 374',
'Rs. 768', 'Rs. 1008 Rs. 1440( 30% OFF)', 'Rs. 496', 'Rs. 216',
'Rs. 1275', 'Rs. 1080', 'Rs. 1800', 'Rs. 592', 'Rs. 3400',
'Rs. 1280', 'Rs. 1300', 'Rs. 340', 'Rs. 1007', 'Rs. 64', 'Rs. 787',
'Rs. 1115', 'Rs. 1595', 'Rs. 900', 'Rs. 486', 'Rs. 1584',
'Rs. 2072', 'Rs. 1920', 'Rs. 2726', 'Rs. 790', 'Rs. 944',
'Rs. 598', 'Rs. 333', 'Rs. 555', 'Rs. 425', 'Rs. 498', 'Rs. 576',
'Rs. 599', 'Rs. 595', 'Rs. 220', 'Rs. 775', 'Rs. 548', 'Rs. 575',
'Rs. 348', 'Rs. 265', 'Rs. 698', 'Rs. 699', 'Rs. 458', 'Rs. 777',
'Rs. 648', 'Rs. 748', 'Rs. 445', 'Rs. 485', 'Rs. 1904', 'Rs. 688',
'Rs. 1142', 'Rs. 2549', 'Rs. 14260', 'Rs. 2224', 'Rs. 9792',
'Rs. 1274', 'Rs. 6000', 'Rs. 1825', 'Rs. 1277', 'Rs. 2100',
'Rs. 2400', 'Rs. 3600', 'Rs. 2545', 'Rs. 3998', 'Rs. 928',
'Rs. 9918', 'Rs. 1593', 'Rs. 600 Rs. 800( 25% OFF)',
'Rs. 489.3 Rs. 699( 30% OFF)', 'Rs. 206', 'Rs. 1208', 'Rs. 4134',
'Rs. 5278', 'Rs. 2480', 'Rs. 2141', 'Rs. 5118', 'Rs. 4024',
'Rs. 1180', 'Rs. 3500', 'Rs. 3388', 'Rs. 477 Rs. 795( 40% OFF)',
'Rs. 3437', 'Rs. 1429', 'Rs. 2514', 'Rs. 2250', 'Rs. 742',
'Rs. 1550', 'Rs. 590', 'Rs. 1939'], dtype=object)
The Price column also contains values like Rs. 358.5 Rs. 478( 25% OFF). We can use RegEx to match the price of the book.
def extract_price(price: str) -> dict:
matches = re.findall(r'Rs\.\s+(\d+(\.\d+)?)', price)
if matches:
if len(matches) == 1:
# only one price, treat it as original price
list_price = float(matches[0][0])
return {
'discounted_price': 0, # No discounted price
'list_price': list_price
}
elif len(matches) >= 2:
# Two prices, treat the first as discounted and second as original
discounted_price = float(matches[0][0])
list_price = float(matches[1][0])
return {
'discounted_price': discounted_price,
'list_price': list_price
}
return {'discounted_price': None, 'list_price': None}
print(extract_price("Rs. 358.5 Rs. 478( 25% OFF)"))
print(extract_price("Rs. 450"))
{'discounted_price': 358.5, 'list_price': 478.0}
{'discounted_price': 0, 'list_price': 450.0}
prices = df["Price"].apply(extract_price)
df["Price"] = prices.apply(lambda x: x["discounted_price"]
if x["discounted_price"] != 0 else x["list_price"])
df["List Price"] = prices.apply(lambda x: x["list_price"])
df["Discount Amount"] = prices.apply(
lambda x: 0 if x["discounted_price"] == 0 else x["list_price"] - x["discounted_price"])
df[["Price", "List Price", "Discount Amount"]]
| Price | List Price | Discount Amount | |
|---|---|---|---|
| 0 | 720.0 | 720.0 | 0.0 |
| 1 | 1200.0 | 1200.0 | 0.0 |
| 2 | 1592.0 | 1592.0 | 0.0 |
| 3 | 1040.0 | 1040.0 | 0.0 |
| 4 | 1118.0 | 1118.0 | 0.0 |
| ... | ... | ... | ... |
| 2835 | 500.0 | 500.0 | 0.0 |
| 2836 | 798.0 | 798.0 | 0.0 |
| 2837 | 632.0 | 632.0 | 0.0 |
| 2838 | 880.0 | 880.0 | 0.0 |
| 2839 | 560.0 | 560.0 | 0.0 |
2840 rows × 3 columns
print(df["Number of Pages"])
print("Null count: ", df["Number of Pages"].isna().sum())
0 33 Pages
1 NaN
2 457 Pages
3 248 Pages
4 242 Pages
...
2835 336 Pages
2836 112 Pages
2837 266 Pages
2838 312 Pages
2839 189 Pages
Name: Number of Pages, Length: 2840, dtype: object
Null count: 200
There aren't any values that have a decimal, so we can just replace non-digit \D with spaces using RegEx
df["Number of Pages"] = df["Number of Pages"][~df["Number of Pages"].isna()] \
.str.replace(r"\D", "", regex=True).astype("int")
df["Number of Pages"]
0 33.0
1 NaN
2 457.0
3 248.0
4 242.0
...
2835 336.0
2836 112.0
2837 266.0
2838 312.0
2839 189.0
Name: Number of Pages, Length: 2840, dtype: float64
df["Weight"].unique()
array(['196g', '1050g', '970g', '200g', '340g', '515g', '2400g', '370g',
'1200g', '290g', '585g', '250g', '560g', '300g', '344g', '1260g',
'525g', '86g', '2160g', '1550g', '490g', '640g', '339g', '520g',
'180g', '550g', '260g', '30g', '700g', '675g', '1730g', '2960g',
'335g', '415g', '1140g', '790g', '170g', '1320g', '1500g', '530g',
'150g', '310g', '569g', '130g', '100g', '1400g', '1160g', '80g',
'800g', '360g', '825g', '500g', '600g', '280g', '140g', '110g',
'135g', '910g', '1375g', '275g', '660g', '642g', '210g', '206g',
'450g', '75g', '1340g', '161g', '940g', '380g', '460g', '477g',
'220g', '197g', '615g', '565g', '315g', '690g', '1323g', '225g',
'70g', '230g', '2475g', '555g', '830g', '1640g', '195g', '215g',
'390g', '305g', '960g', '650g', '540g', '720g', '715g', '405g',
'1860g', '915g', '725g', '843g', '175g', '365g', '249g', '625g',
'235g', '245g', '134g', '325g', '505g', '410g', '285g', '648g',
'425g', '400g', '181g', '350g', '454g', '320g', '545g', '440g',
'190g', '375g', '1180g', '136g', '160g', '205g', '470g', '145g',
'865g', '240g', '465g', '192g', '345g', '219g', '730g', '270g',
'420g', '480g', '330g', '355g', '395g', '295g', '95g', '265g',
'512g', '820g', '635g', '735g', '90g', '430g', '165g', '255g',
'269g', '185g', '155g', '670g', '337g', '294g', '166g', '232g',
'189g', '277g', '72g', '348g', '301g', '2800g', '1425g', '870g',
'385g', '475g', '252g', '575g', '91g', '710g', '1075g', '630g',
'1680g', '935g', '177g', '378g', '203g', '2000g', '535g', '610g',
'510g', '216g', '590g', '317g', '120g', '99g', '142g', '495g',
'1030g', '580g', '164g', '169g', '605g', '1100g', '422g', '807g',
'336g', '890g', '187g', '85g', '34g', '188g', '125g', '105g',
'146g', '7900g', '122g', '920g', '147g', '810g', '162g', '620g',
'65g', '60g', '228g', '876g', '50g', '455g', '238g', '123g',
'323g', '93g', '248g', '894g', '213g', '1130g', '765g', '261g',
'514g', '850g', '322g', '257g', '94g', '595g', '491g', '1790g',
'115g', '264g', '1175g', '506g', '1010g', '1300g', '1810g', '944g',
'174g', '771g', '885g', '805g', '382g', '1440g', '204g', '272g',
'1430g', '1275g', '570g', '1073g', '312g', '1335g', '127g', '657g',
'435g', '372g', '246g', '499g', '1250g', '746g', '4940g', '211g',
'369g', '191g', '397g', '376g', '4000g', '222g', '1350g', '900g',
'55g', '566g', '445g', '131g', '945g', '1190g', '1090g', '1080g',
'760g', '2050g', '318g', '8000g', '263g', '1082g', '352g', '242g',
'548g', '167g', '485g', '1070g', '1465g', '1485g', '780g', '1185g',
'680g', '1040g', '346g', '1460g', '214g', '1490g', '293g', '227g',
'579g', '52g', '62g', '750g', '401g', '386g', '429g', '1330g',
'840g', '1000g', '3000g', '1150g', '1299g', '975g', '1g', '2425g',
'2150g', '1055g', '1950g', '1820g', '1620g', '880g', '860g',
'1760g', '770g', '1420g', '1530g', '1870g', '432g', '1560g',
'1590g', '786g', '785g', '159g', '132g', '999g', '1165g', '226g',
'816g', '2250g', '2930g', '67g', '144g', '243g', '685g', '645g',
'439g', '463g', '302g', '296g', '407g', '359g', '740g', '202g',
'459g', '1060g', '736g', '995g', '1280g', '835g', '286g', '695g'],
dtype=object)
df["Weight"] = df["Weight"].str.replace("g", "", regex=True).astype("int")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2840 entries, 0 to 2839 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title 2840 non-null object 1 Author 2840 non-null object 2 Price 2840 non-null float64 3 Rating 252 non-null float64 4 Limited Stock 1729 non-null object 5 Discount 41 non-null object 6 Genre 2840 non-null object 7 Number of Pages 2640 non-null float64 8 Weight 2840 non-null int64 9 ISBN 2840 non-null object 10 Language 2840 non-null object 11 Related Genres 2840 non-null object 12 Subgenres 2655 non-null object 13 Synopsis 2835 non-null object 14 URL 2840 non-null object 15 List Price 2840 non-null float64 16 Discount Amount 2840 non-null float64 dtypes: float64(5), int64(1), object(11) memory usage: 377.3+ KB
df["Discount"].value_counts()
Discount ( 25% OFF) 19 ( 30% OFF) 18 ( 40% OFF) 4 Name: count, dtype: int64
df["Discount"] = df["Discount"].str.replace(
r"\D", "", regex=True).astype("float")
df.fillna({"Discount": 0}, inplace=True)
df["Discount"] = df["Discount"].apply(lambda x: x / 100)
df["Discount"].value_counts()
Discount 0.00 2799 0.25 19 0.30 18 0.40 4 Name: count, dtype: int64
Limited Stocks¶
Limited Stock data is only extracted when the stock is limited. Safe to make this column binary.
df["Limited Stock"]
0 Only 3 item left in stock!
1 NaN
2 NaN
3 Only 6 item left in stock!
4 NaN
...
2835 NaN
2836 Only 4 item left in stock!
2837 Only 5 item left in stock!
2838 NaN
2839 NaN
Name: Limited Stock, Length: 2840, dtype: object
df["Limited Stock"] = df["Limited Stock"].apply(
lambda x: True if pd.notna(x) else False)
df["Limited Stock"].value_counts()
Limited Stock True 1729 False 1111 Name: count, dtype: int64
Author¶
df["Author"]
0 by Julia Donaldson
1 by Michel Peisel
2 by Dalai Lama Xiv Bstan-ʼDzin-Rgya-Mtsho and J...
3 by Barry Green and W. Timothy Gallwey
4 by Nir Eyal and Ryan Hoover
...
2835 by Bishnu Raj Upreti
2836 by Jack Kerouac
2837 by John Wood
2838 by Peter Matthiessen
2839 by Amar. Bhushan
Name: Author, Length: 2840, dtype: object
df["Author"] = df["Author"].str.replace("by", "")
df["Author"] = df["Author"].str.replace(" and ", ", ")
df["Author"] = df["Author"].apply(lambda x: x.strip())
df["Author"]
0 Julia Donaldson
1 Michel Peisel
2 Dalai Lama Xiv Bstan-ʼDzin-Rgya-Mtsho, John Sn...
3 Barry Green, W. Timothy Gallwey
4 Nir Eyal, Ryan Hoover
...
2835 Bishnu Raj Upreti
2836 Jack Kerouac
2837 John Wood
2838 Peter Matthiessen
2839 Amar. Bhushan
Name: Author, Length: 2840, dtype: object
Genres on BooksMandala¶
Preprocessing the Related Genres column¶
Every book has multiple genres. A single book isn't limited to a single genre.
The related genres section on a book's webpage on BooksMandala describe the multiple genres a book belongs to.
df["Related Genres"].value_counts()
Related Genres
Foreign Languages, Foreign Languages 122
Nepali, Nepali 105
Miscellaneous\n, Miscellaneous 102
Arts and Photography, Arts and Photography 52
Kids and Teens, Kids and Teens 51
...
Spirituality and Philosophy\n, Nature\n, Spirituality and Philosophy, Nature 1
Spirituality and Philosophy, History, Biography, and Social Science, Nature, Spirituality and Philosophy, History, Biography, and Social Science, Nature 1
Nature, Nepali, Nature, Nepali 1
Fiction and Literature, Fiction and Literature, Fiction and Literature, Nature, Fiction and Literature, Fiction and Literature, Fiction and Literature, Fiction and Literature 1
Travel, History, Biography, and Social Science, Travel, History, Biography, and Social Science 1
Name: count, Length: 613, dtype: int64
df["Genre"].unique()
array(['Arts And Photography', 'Business And Investing',
'Fiction And Literature', 'Foreign Languages',
'History Biography And Social Science', 'Kids And Teens',
'Learning And Reference', 'Lifestyle And Wellness',
'Manga And Graphic Novels', 'Miscellaneous', 'Nature', 'Nepali',
'Political Science', 'Rare Coffee Table Books', 'Religion',
'Self Improvement And Relationships',
'Spirituality And Philosophy', 'Technology', 'Travel'],
dtype=object)
def preprocess_related_genres(related_genres: str) -> list[str]:
unique_genres = list(df["Genre"].unique())
genres = re.sub(r"History, Biography, and Social Science",
"History Biography And Social Science", related_genres).strip()
genres = [genre.strip().title() for genre in genres.split(",")]
extracted_genres = []
for genre in genres:
if genre in unique_genres and genre not in extracted_genres:
extracted_genres.append(genre)
return extracted_genres
print(df["Related Genres"][5])
print(preprocess_related_genres(df["Related Genres"][5]))
print(df["Related Genres"][1845])
print(preprocess_related_genres(df["Related Genres"][1845]))
Arts and Photography, Nepali, Arts and Photography, Nepali ['Arts And Photography', 'Nepali'] History, Biography, and Social Science, Nepali, History, Biography, and Social Science, Nepali ['History Biography And Social Science', 'Nepali']
df["Related Genres"] = df["Related Genres"].apply(preprocess_related_genres)
df[["Genre", "Related Genres"]]
| Genre | Related Genres | |
|---|---|---|
| 0 | Arts And Photography | [Kids And Teens, Arts And Photography] |
| 1 | Arts And Photography | [Arts And Photography, Miscellaneous] |
| 2 | Arts And Photography | [Travel, Arts And Photography] |
| 3 | Arts And Photography | [Arts And Photography, Self Improvement And Re... |
| 4 | Arts And Photography | [Business And Investing, Arts And Photography,... |
| ... | ... | ... |
| 2835 | Travel | [Travel, Nepali] |
| 2836 | Travel | [Fiction And Literature] |
| 2837 | Travel | [History Biography And Social Science, Busines... |
| 2838 | Travel | [History Biography And Social Science] |
| 2839 | Travel | [Nepali, History Biography And Social Science] |
2840 rows × 2 columns
df["Related Genres"].value_counts()
Related Genres
[Nepali] 139
[Fiction And Literature] 138
[Foreign Languages] 122
[Miscellaneous] 119
[Kids And Teens] 103
...
[Nature, Nepali] 1
[Spirituality And Philosophy, History Biography And Social Science, Nature] 1
[Spirituality And Philosophy, Nature] 1
[Self Improvement And Relationships, Spirituality And Philosophy, Arts And Photography] 1
[Fiction And Literature, Learning And Reference, History Biography And Social Science] 1
Name: count, Length: 321, dtype: int64
Sub-genres¶
def preprocess_subgenres(genre_string: str) -> list[str]:
genre_string_list = re.sub(r"\n", "", genre_string).strip().split(",")
subgenres = [subgenre.strip() for subgenre in genre_string_list]
extracted = []
for subgenre in subgenres:
if subgenre not in extracted:
extracted.append(subgenre)
return extracted
print(preprocess_subgenres(df["Subgenres"][5]))
print(preprocess_subgenres(df["Subgenres"][1455]))
['Picture Books', 'Books on Nepal'] ['Books on India', 'Politics', 'History']
df["Subgenres"] = df["Subgenres"][~df["Subgenres"].isna()] \
.apply(preprocess_subgenres)
Genre Accuracy¶
Every book on BooksMandala has multiple genres. Although the data is scraped by genre, the obtained data may not represent the core genre of the book. A book's individual page has a "Related Genre" section on which are listed its multiple genres and which does closely describe the core genres of the book.
Take the book Big Magic by Elizabeth Gilbert for example. Since the data was extracted by looking for books through the genre categories (like this) available on the BooksMandala website, BooksMandala categorizes Big Magic under Arts and Photography, which isn't false. However, most websites like Goodreads (a website for book readers and recommendations) puts Big Magic under Self Help.
df[df["Title"] == "Big Magic"]
| Title | Author | Price | Rating | Limited Stock | Discount | Genre | Number of Pages | Weight | ISBN | Language | Related Genres | Subgenres | Synopsis | URL | List Price | Discount Amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 155 | Big Magic | Elizabeth Gilbert | 798.0 | NaN | True | 0.0 | Arts And Photography | 271.0 | 215 | 9781408886182 | English | [Self Improvement And Relationships, Arts And ... | [Self Help, Art, Psychology, Memoir] | Readers of all ages and walks of life have dra... | https://booksmandala.com/books/big-magic-15312 | 798.0 | 0.0 |
df["Genre"] = df["Related Genres"].apply(lambda x: x[0] if x else df["Genre"])
df[df["Title"] == "Big Magic"]
| Title | Author | Price | Rating | Limited Stock | Discount | Genre | Number of Pages | Weight | ISBN | Language | Related Genres | Subgenres | Synopsis | URL | List Price | Discount Amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 155 | Big Magic | Elizabeth Gilbert | 798.0 | NaN | True | 0.0 | Self Improvement And Relationships | 271.0 | 215 | 9781408886182 | English | [Self Improvement And Relationships, Arts And ... | [Self Help, Art, Psychology, Memoir] | Readers of all ages and walks of life have dra... | https://booksmandala.com/books/big-magic-15312 | 798.0 | 0.0 |
Here, by taking the first genre listed on the book's webpage, we get better and closer categorization of the genre of the book.
Duplicates¶
df.drop_duplicates(subset=['Title', 'ISBN'], keep='last', inplace=True)
Null values¶
na = df.isna().sum()
na[na > 0]
Rating 2087 Number of Pages 188 Synopsis 5 dtype: int64
(df["Number of Pages"].isna().sum() / df.shape[0]) * 100
8.355555555555554
(df["Rating"].isna().sum() / df.shape[0]) * 100
92.75555555555556
df.drop("Rating", axis=1, inplace=True)
Synopsis¶
def clean_synopses(text: str) -> str | None:
default_patterns = [
r"A description of this book has not been provided",
r"is available for purchase at Books Mandala",
r"No description available",
r"^\s*\d+\s*$" # detects only numbers and whitespaces
]
for pattern in default_patterns:
if re.search(pattern, str(text), re.IGNORECASE):
return None
return text
df["Synopsis"] = df["Synopsis"].apply(clean_synopses)
Instead of dropping null values, I can try to get the synopses using Google Books API.
import requests
from dotenv import load_dotenv
import os
load_dotenv()
def get_synopsis(isbn: str) -> str | None:
# GET YOUR OWN API KEY!!!
response = requests.get(
f"https://www.googleapis.com/books/v1/volumes?q=isbn:{isbn}&key={os.getenv('API_KEY')}")
data = response.json()
if "items" not in data or len(data["items"]) == 0:
return None
new_response = requests.get(data["items"][0].get("selfLink", {}))
new_data = new_response.json()
new_data = new_data.get("volumeInfo", {})
return new_data.get("description", None)
def fill_synopsis(row):
if not pd.isna(row["Synopsis"]):
return row["Synopsis"]
return get_synopsis(row["ISBN"])
df[df["Synopsis"].isna()]["ISBN"]
11 9789993347972
15 BM35556B95F6BE
27 9789386671769
28 9789937050098
37 9788177696479
...
2776 9781838952730
2778 BM13596E34E338
2812 9789815204681
2817 BMD38A20C2904F
2839 9789353570132
Name: ISBN, Length: 321, dtype: object
df['Synopsis'] = df.apply(fill_synopsis, axis=1)
df["Synopsis"].isna().sum()
308
df[df["Synopsis"].isna()]
| Title | Author | Price | Limited Stock | Discount | Genre | Number of Pages | Weight | ISBN | Language | Related Genres | Subgenres | Synopsis | URL | List Price | Discount Amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | pokhara y el annapurna | Dinesh. Shrestha | 695.0 | False | 0.0 | Arts And Photography | NaN | 250 | 9789993347972 | English | [Arts And Photography] | [Photography and Filmmaking] | None | https://booksmandala.com/books/pokhara-y-el-an... | 695.0 | 0.0 |
| 15 | Tibetan Children's Colouring Book | Unknown | 256.0 | False | 0.0 | Kids And Teens | 16.0 | 344 | BM35556B95F6BE | English | [Kids And Teens, Arts And Photography] | [Coloring for Children, Coloring Books] | None | https://booksmandala.com/books/tibetan-childre... | 256.0 | 0.0 |
| 27 | Solimo Copy Colour Pack, Set of 6 Books | Unassigned | 960.0 | False | 0.0 | Arts And Photography | NaN | 550 | 9789386671769 | English | [Arts And Photography] | [Coloring Books] | None | https://booksmandala.com/books/solimo-copy-col... | 960.0 | 0.0 |
| 28 | Color Nepal coloring book for all | Laibari | 700.0 | True | 0.0 | Arts And Photography | NaN | 260 | 9789937050098 | English | [Arts And Photography] | [Coloring Books] | None | https://booksmandala.com/books/color-nepal-col... | 700.0 | 0.0 |
| 37 | Mandala Colouring Book | Unassigned | 300.0 | True | 0.0 | Arts And Photography | NaN | 250 | 9788177696479 | English | [Arts And Photography] | [Art] | None | https://booksmandala.com/books/mandala-colouri... | 300.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2774 | The World Pocket Atlas | Unassigned | 742.0 | True | 0.0 | Travel | NaN | 250 | 9788182525160 | English | [Travel] | [Atlas] | None | https://booksmandala.com/books/the-world-pocke... | 742.0 | 0.0 |
| 2776 | Everest 1922 | Mick Conefrey | 798.0 | True | 0.0 | Travel | 310.0 | 245 | 9781838952730 | English | [Travel] | [Climbing and Mountaineering] | None | https://booksmandala.com/books/everest-1922-49955 | 798.0 | 0.0 |
| 2778 | Destination Nepal | Rabindra Dhoju | 110.0 | False | 0.0 | Travel | 24.0 | 60 | BM13596E34E338 | English | [Travel] | [Travel Guide Books] | None | https://booksmandala.com/books/destination-nep... | 110.0 | 0.0 |
| 2817 | The Pokhara Valley | Rabinthara Dhoju | 110.0 | False | 0.0 | Travel | 24.0 | 100 | BMD38A20C2904F | English | [Travel] | [Travel Guide Books] | None | https://booksmandala.com/books/the-pokhara-val... | 110.0 | 0.0 |
| 2839 | INSIDE NEPAL/THE WALK-IN. | Amar. Bhushan | 560.0 | False | 0.0 | Nepali | 189.0 | 175 | 9789353570132 | English | [Nepali, History Biography And Social Science] | [Books on Nepal, History] | None | https://booksmandala.com/books/inside-nepalthe... | 560.0 | 0.0 |
308 rows × 16 columns
df.dropna(subset=["Synopsis"], inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1942 entries, 3 to 2838 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title 1942 non-null object 1 Author 1942 non-null object 2 Price 1942 non-null float64 3 Limited Stock 1942 non-null bool 4 Discount 1942 non-null float64 5 Genre 1942 non-null object 6 Number of Pages 1888 non-null float64 7 Weight 1942 non-null int64 8 ISBN 1942 non-null object 9 Language 1942 non-null object 10 Related Genres 1942 non-null object 11 Subgenres 1942 non-null object 12 Synopsis 1942 non-null object 13 URL 1942 non-null object 14 List Price 1942 non-null float64 15 Discount Amount 1942 non-null float64 dtypes: bool(1), float64(5), int64(1), object(9) memory usage: 244.6+ KB
Visualizing the distribution of Number of Pages before imputing
Number of Pages¶
fig = px.box(df,
x="Number of Pages",
title="Distribution of Number of Pages")
fig.update_layout(bargap=0.1)
fig.show()
df.fillna({"Number of Pages": df["Number of Pages"].median()}, inplace=True)
fig = px.box(df,
x="Number of Pages",
title="Distribution of Number of Pages")
fig.update_layout(bargap=0.1)
fig.show()
Export cleaned data¶
df = df[['Title', 'Author', 'Price', 'List Price',
'Discount Amount', 'Limited Stock', 'Discount',
'Genre', 'Number of Pages', 'Weight', 'ISBN', 'Language',
'Related Genres', 'Subgenres', 'Synopsis', 'URL']].copy()
df.rename({"Weight": "Weight(grams)"}, inplace=True)
df.to_csv("data/dataset_cleaned.csv", index=False)
Import cleaned data¶
df = pd.read_csv(
"/home/am/booksmandala-data-analytics/notebooks/data/dataset_cleaned.csv")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1942 entries, 0 to 1941 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title 1942 non-null object 1 Author 1939 non-null object 2 Price 1942 non-null float64 3 List Price 1942 non-null float64 4 Discount Amount 1942 non-null float64 5 Limited Stock 1942 non-null bool 6 Discount 1942 non-null float64 7 Genre 1942 non-null object 8 Number of Pages 1942 non-null float64 9 Weight 1942 non-null int64 10 ISBN 1942 non-null object 11 Language 1942 non-null object 12 Related Genres 1942 non-null object 13 Subgenres 1942 non-null object 14 Synopsis 1942 non-null object 15 URL 1942 non-null object dtypes: bool(1), float64(5), int64(1), object(9) memory usage: 229.6+ KB
Clean and convert¶
df["Author"].isna().sum()
3
df.dropna(subset=["Author"], inplace=True)
def clean_and_convert_to_list(value):
if isinstance(value, str):
value = value.strip()
if value.startswith('[') and value.endswith(']'):
try:
return ast.literal_eval(value)
except (ValueError, SyntaxError):
return []
elif isinstance(value, list):
return value
return []
df['Related Genres'] = df['Related Genres'].apply(clean_and_convert_to_list)
df['Subgenres'] = df['Subgenres'].apply(clean_and_convert_to_list)
Exploratory Analysis and Visualizations¶
df.describe()
| Price | List Price | Discount Amount | Discount | Number of Pages | Weight | |
|---|---|---|---|---|---|---|
| count | 1939.000000 | 1939.000000 | 1939.000000 | 1939.000000 | 1939.000000 | 1939.000000 |
| mean | 1034.083703 | 1037.166065 | 3.082362 | 0.004513 | 286.327488 | 385.328520 |
| std | 1217.404008 | 1216.543176 | 26.548347 | 0.035758 | 195.250736 | 421.485496 |
| min | 60.000000 | 60.000000 | 0.000000 | 0.000000 | 10.000000 | 1.000000 |
| 25% | 560.000000 | 560.000000 | 0.000000 | 0.000000 | 192.000000 | 210.000000 |
| 50% | 800.000000 | 800.000000 | 0.000000 | 0.000000 | 260.000000 | 285.000000 |
| 75% | 1118.000000 | 1118.000000 | 0.000000 | 0.000000 | 347.000000 | 412.500000 |
| max | 25598.000000 | 25598.000000 | 432.000000 | 0.400000 | 3766.000000 | 8000.000000 |
avg_data = df.groupby('Genre').agg(
{
'Price': 'mean',
'Number of Pages': 'mean',
'ISBN': 'count'
}
).reset_index()
avg_data.columns = ['Genre', 'Average Price',
'Average Page Count', 'Number of Books']
avg_data
| Genre | Average Price | Average Page Count | Number of Books | |
|---|---|---|---|---|
| 0 | Arts And Photography | 1983.400000 | 191.708333 | 120 |
| 1 | Business And Investing | 1045.455189 | 313.669811 | 212 |
| 2 | Fiction And Literature | 896.280398 | 320.732955 | 352 |
| 3 | Foreign Languages | 640.125000 | 304.281250 | 32 |
| 4 | History Biography And Social Science | 1090.775494 | 348.316206 | 253 |
| 5 | Kids And Teens | 875.896000 | 131.320000 | 125 |
| 6 | Learning And Reference | 1138.465517 | 368.965517 | 58 |
| 7 | Lifestyle And Wellness | 1028.132353 | 304.911765 | 68 |
| 8 | Manga And Graphic Novels | 2138.722892 | 317.277108 | 83 |
| 9 | Miscellaneous | 615.397727 | 285.034091 | 88 |
| 10 | Nature | 996.758621 | 252.913793 | 58 |
| 11 | Nepali | 594.190083 | 252.991736 | 121 |
| 12 | Rare Coffee Table Books | 2233.500000 | 86.000000 | 2 |
| 13 | Religion | 778.200000 | 267.416667 | 60 |
| 14 | Self Improvement And Relationships | 846.029697 | 264.133333 | 165 |
| 15 | Spirituality And Philosophy | 775.662338 | 275.220779 | 77 |
| 16 | Technology | 1202.320000 | 267.720000 | 25 |
| 17 | Travel | 1153.900000 | 299.900000 | 40 |
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
# values outside 1.5 * IQR from Q1 and Q3
no_outlier_df = df[(df['Price'] >= (Q1 - 1.5 * IQR)) & (df['Price'] <= (Q3 + 1.5 * IQR))]
print("Before removing outliers: ", df.shape)
print("After removing outliers:", no_outlier_df.shape)
Before removing outliers: (1939, 16) After removing outliers: (1789, 16)
Histograms¶
fig = px.histogram(df,
x="Price",
marginal="box",
title="Distribution of Book Prices")
fig.update_layout(bargap=0.1)
fig.show()
fig = px.histogram(no_outlier_df,
x="Price",
marginal="box",
title="Distribution of Book Prices (after removing outliers)")
fig.update_layout(bargap=0.1)
fig.show()
fig = px.histogram(df,
x="Weight",
marginal="box",
title="Distribution of Book Weight")
fig.update_layout(bargap=0.1)
fig.show()
fig = px.histogram(avg_data,
y='Average Price',
x='Genre',
title="Average Price by Genre",
log_y=True)
fig.add_trace(go.Scatter(x=avg_data['Genre'],
y=avg_data['Average Price'],
mode='lines',
name='Average Price Trend',
line=dict(color='DarkSlateGrey', width=1)))
fig.update_layout(height=500, showlegend=False, bargap=0.1)
fig.show()
Bar charts¶
fig = px.bar(df['Genre'].value_counts().reset_index(),
x='Genre',
y='count',
title='Number of Books by Genre')
fig.update_layout(height=500, bargap=0.1)
# fig.add_trace(go.Scatter(x=df['Genre'].value_counts().reset_index()['Genre'],
# y=df['Genre'].value_counts().reset_index()['count'],
# mode='lines',
# name='x',
# line=dict(color='DarkSlateGrey', width=1)))
fig.show()
fig = px.bar(df['Related Genres'].explode().value_counts().reset_index(),
x='Related Genres',
y='count',
title='Number of Books by Related Genre (Inclusive)')
fig.update_layout(height=500, bargap=0.1)
fig.show()
fig = px.bar(df['Subgenres'].explode().value_counts().reset_index()[:20],
x='Subgenres',
y='count',
title='Number of Books by Top 20 Subgenres')
fig.update_layout(height=500)
fig.show()
limited_books = df.groupby(["Genre", "Limited Stock"]
).size().reset_index(name='count')
fig = px.bar(limited_books,
x='Genre',
y='count',
color='Limited Stock',
title='Genres by Limited Stock',
barmode='group')
fig.update_layout(height=500)
fig.show()
top_authors = df["Author"].value_counts().reset_index()
top_authors
| Author | count | |
|---|---|---|
| 0 | Unassigned | 25 |
| 1 | Jeff Kinney | 24 |
| 2 | Hergé | 23 |
| 3 | Thich Nhat Hanh | 16 |
| 4 | Kentaro Miura | 16 |
| ... | ... | ... |
| 1337 | Yoshitoki Oima | 1 |
| 1338 | Luo Di Cheng Qiu | 1 |
| 1339 | Jim Starlin | 1 |
| 1340 | Negi Haruba | 1 |
| 1341 | Peter Matthiessen | 1 |
1342 rows × 2 columns
fig = px.bar(top_authors[1:11],
x='Author',
y='count',
title='Top Authors by Number of Books')
fig.update_xaxes(title_text="Authors")
fig.update_yaxes(title_text="Number of Books")
fig.show()
fig = px.bar(df.sort_values(by='Price', ascending=False)[:10],
x='Title',
y='Price',
title='Top 10 Most Expensive Books',
color='Title')
fig.update_layout(height=500)
fig.update_xaxes(showticklabels=False)
fig.show()
Pie charts¶
fig = px.pie(df["Language"].value_counts().reset_index(),
values='count',
names='Language',
hole=0.4,
title='Distribution of Books by Language')
fig.update_layout(height=600)
fig.show()
fig = px.pie(df["Genre"].value_counts().reset_index(),
values='count',
names='Genre',
title='Distribution of Books by Genres',
hole=0.4,
color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_layout(height=600)
fig.show()
Scatter Plots¶
fig = px.scatter(df,
x='Price',
y='Discount Amount',
title='Price vs. Discount Amount',
color='Discount',
hover_data=["Title", "Limited Stock", "Discount"])
fig.update_traces(marker=dict(size=5),
selector=dict(mode='markers'))
fig.show()
fig = px.scatter(df,
x='Weight',
y='Price',
log_y=True,
title='Price vs. Weight',
color='Limited Stock',
hover_data=["Limited Stock", "Genre"])
fig.update_traces(marker=dict(size=5, line=dict(width=0.4, color='DarkSlateGray')))
fig.show()
fig = px.scatter(no_outlier_df,
x='Weight',
y='Price',
log_y=True,
title='Price vs. Weight (No Outliers)',
color='Limited Stock',
hover_data=["Limited Stock", "Genre"])
fig.update_traces(marker=dict(size=5, line=dict(width=0.4, color='DarkSlateGray')))
fig.show()
Box Plots¶
fig = px.violin(df,
x='Genre',
y='Price',
color="Genre",
log_y=True,
title='Prices Distribution by Genre',)
fig.update_layout(height=500, showlegend=False)
fig.show()
fig = px.box(df,
x='Genre',
y='Number of Pages',
color='Genre',
log_y=True,
title="Page Count Distribution by Genre")
fig.update_layout(height=500, showlegend=False)
fig.show()
Bubble Chart¶
fig = px.scatter(avg_data,
x='Average Price',
y='Average Page Count',
size='Number of Books',
color='Genre',
title="Genre Analysis: Price vs. Page Count",
color_discrete_sequence=px.colors.qualitative.Set1)
fig.update_layout(height=600)
fig.update_xaxes(title_text="Average Price")
fig.update_yaxes(title_text="Average Page Count")
fig.update_traces(marker=dict(line=dict(width=1,
color='DarkSlateGrey')),
selector=dict(mode='markers'))
fig.show()
fig = px.scatter(avg_data,
x='Number of Books',
y='Average Price',
color='Genre',
size='Average Price',
title='Book Count vs Price')
fig.update_layout(height=500)
fig.update_traces(marker=dict(line=dict(width=1,
color='DarkSlateGrey')),
selector=dict(mode='markers'))
fig.show()
fig = px.scatter_3d(avg_data,
x='Average Price',
y='Number of Books',
z='Average Page Count',
color='Genre',
title="3D Scatter Plot of Price, Book Count, and Page Count")
fig.update_layout(height=600)
fig.update_traces(marker=dict(size=4,
line=dict(width=1,
color='DarkSlateGrey')),
selector=dict(mode='markers'))
fig.show()
Correlation¶
# though this doesn't reveal anything
correlation = df[['Price', 'Number of Pages',
'Weight', 'Limited Stock']].corr()
fig = px.imshow(correlation,
text_auto=True,
color_continuous_scale='Sunsetdark',
title='Correlation Heatmap')
fig.update_layout(
title='Correlation Heatmap',
xaxis_title='Features',
yaxis_title='Features',
height=600,
width=800
)
fig.show()
Word Cloud¶
text = ' '.join(synopsis for synopsis in df["Synopsis"].dropna())
custom_stopwords = set(STOPWORDS)
custom_stopwords.update(["book", "author", "words", "common", "u"])
wordcloud = WordCloud(stopwords=custom_stopwords,
width=800,
height=400,
background_color='white',
colormap='viridis').generate(text)
fig = px.imshow(wordcloud, text_auto=True)
fig.update_layout(height=500)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
Modeling and Machine Learning¶
Unknown authors heavily affect the recommendation system when recommendations value Authors as well
df = df[df["Author"] != "Unassigned"]
df.columns
Index(['Title', 'Author', 'Price', 'List Price', 'Discount Amount',
'Limited Stock', 'Discount', 'Genre', 'Number of Pages', 'Weight',
'ISBN', 'Language', 'Related Genres', 'Subgenres', 'Synopsis', 'URL'],
dtype='object')
df.drop_duplicates(subset=["ISBN"], keep='last', inplace=True)
Genres and Subgenres¶
df[["Related Genres", "Subgenres"]]
| Related Genres | Subgenres | |
|---|---|---|
| 0 | [Arts And Photography, Self Improvement And Re... | [Music, Self Help, Psychology] |
| 1 | [Arts And Photography, Nepali] | [Picture Books, Books on Nepal] |
| 2 | [History Biography And Social Science, Arts An... | [Biography, Memoir, Art] |
| 3 | [Arts And Photography, Learning And Reference] | [Architecture, Science] |
| 4 | [History Biography And Social Science, Arts An... | [History, Design, Art, Science] |
| ... | ... | ... |
| 1937 | [History Biography And Social Science] | [Memoir] |
| 1938 | [Travel, Nepali] | [Climbing and Mountaineering, Books on Nepal] |
| 1939 | [Fiction And Literature] | [Classics, Contemporary] |
| 1940 | [History Biography And Social Science, Busines... | [Memoir, Biography, Business] |
| 1941 | [History Biography And Social Science] | [Autobiography] |
1913 rows × 2 columns
unique_subgenres = []
max_subgenre_len = 0
for subgenre in df["Subgenres"]:
try:
if len(subgenre) > max_subgenre_len:
max_subgenre_len = len(subgenre)
for nested_item in subgenre:
if nested_item not in unique_subgenres:
unique_subgenres.append(nested_item)
except:
pass
print(unique_subgenres)
print(len(unique_subgenres))
print("Max Len: ", max_subgenre_len)
['Music', 'Self Help', 'Psychology', 'Picture Books', 'Books on Nepal', 'Biography', 'Memoir', 'Art', 'Architecture', 'Science', 'History', 'Design', 'Business', 'Stress Management', 'Philosophy', 'Ages 3 to 5', 'Coloring Books', 'Management', 'Leadership', 'Ages 6 to 8', 'Childrens', 'Action and Adventure', 'Fashion', 'Fantasy', 'Romance', 'Young Adult', 'Poetry and Prose', 'Autobiography', 'Mindfulness', 'Photography and Filmmaking', 'Science Fiction', 'Contemporary', 'Humor', 'Classics', 'Economics', 'Sociology', 'Ages 9 to 12', 'Card Games', 'Short Story', 'Buddhism', 'Productivity', 'Time Management', 'Finance', 'Biology', 'Investing', 'Feminism', 'Marketing and Sales', 'Politics', 'Money', 'Asian Literature', 'Communication and Social Skills', 'Mental Health', 'Japanese Literature', 'Adult Fiction', 'Drama', 'Military Fiction', 'Historical Fiction', 'Womens Fiction', 'LGBTQIA+', 'Mystery', 'Thriller and Suspense', 'Crime', 'Coming of Age', 'Horror', 'Chick lit', 'French', 'Hindi', 'Hinduism', 'Japanese', 'Russian', 'German', 'Osho', 'Chinese', 'Language', 'Linguistics and Writing', 'Language Books', 'Society and Culture', 'True Crime', 'Anthology', 'Medicine', 'Baby to 2', 'Teens and Young Adult', 'Children Activities and Crafts', 'Nepali Language', 'Nepali Children Book', 'Coloring for Children', 'Parenting and Relationships', 'Neuroscience', 'Dictionaries', 'Puzzles', 'Geography', 'Current Affairs', 'Mathematics', 'Motivational', 'Health', 'Food and Drinks', 'Football', 'Sports', 'Meditation and Yoga', 'Pregnancy and Childbirth', 'Card Decks and Oracles', 'Healing', 'Cookbooks', 'Diary', 'Journal', 'Quotes', 'Mythology', 'Tarot', 'Comics', 'Manga', 'Graphic Novels', 'Books on Tibet', 'Books of Bangladesh', 'Books on India', 'Environment', 'Trees and Plants', 'Encyclopedias', 'Animals and Pets', 'Gems and Jewelleries', 'Climbing and Mountaineering', 'Travel Guide Books', 'Astrology', 'Books On Himalayas', 'Nepali Literature', 'Paranormal', 'Islam', 'Modern Classic', 'Christianity', 'Anthropology', 'Sex', 'Computers and Internet', 'Artificial Intelligence', 'BlockChain Technology', 'Programming', 'Engineering', 'Law', 'Journalism', 'Atlas', 'British Literature'] 139 Max Len: 8
max_relgenres_len = 0
for genres in df["Related Genres"]:
if len(genres) > max_relgenres_len:
max_relgenres_len = len(genres)
print("Max Related Genres Length: ", max_relgenres_len)
Max Related Genres Length: 5
df["Related Genres"].explode().value_counts()
Related Genres History Biography And Social Science 546 Fiction And Literature 411 Self Improvement And Relationships 309 Spirituality And Philosophy 268 Business And Investing 261 Learning And Reference 226 Kids And Teens 175 Arts And Photography 160 Manga And Graphic Novels 154 Nepali 152 Lifestyle And Wellness 142 Religion 135 Miscellaneous 113 Nature 112 Travel 90 Technology 71 Foreign Languages 46 Rare Coffee Table Books 5 Name: count, dtype: int64
df["Subgenres"].explode().value_counts()
Subgenres
Self Help 279
Philosophy 239
Business 236
History 199
Science 188
...
Geography 1
Astrology 1
Gems and Jewelleries 1
Encyclopedias 1
British Literature 1
Name: count, Length: 139, dtype: int64
Encoding¶
mlb_related = MultiLabelBinarizer()
related_genres_encoded = mlb_related.fit_transform(df["Related Genres"])
mlb_subgenres = MultiLabelBinarizer()
subgenres_encoded = mlb_subgenres.fit_transform(df["Subgenres"])
display(pd.DataFrame(
related_genres_encoded, columns=mlb_related.classes_))
| Arts And Photography | Business And Investing | Fiction And Literature | Foreign Languages | History Biography And Social Science | Kids And Teens | Learning And Reference | Lifestyle And Wellness | Manga And Graphic Novels | Miscellaneous | Nature | Nepali | Rare Coffee Table Books | Religion | Self Improvement And Relationships | Spirituality And Philosophy | Technology | Travel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1908 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1909 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1910 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1911 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1912 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1913 rows × 18 columns
display(pd.DataFrame(subgenres_encoded, columns=mlb_subgenres.classes_))
| Action and Adventure | Adult Fiction | Ages 3 to 5 | Ages 6 to 8 | Ages 9 to 12 | Animals and Pets | Anthology | Anthropology | Architecture | Art | ... | Stress Management | Tarot | Teens and Young Adult | Thriller and Suspense | Time Management | Travel Guide Books | Trees and Plants | True Crime | Womens Fiction | Young Adult | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1908 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1909 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1910 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1911 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1912 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1913 rows × 139 columns
Encoding¶
Using paraphrase-multilingual-MiniLM-L12-v2 from SentenceTransformers to encode synopses.
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
df.reset_index(inplace=True, drop=True)
synopsis_embeddings = model.encode(df["Synopsis"], show_progress_bar=True)
Combine and Add Weights¶
weights = {
"genre_weight": 0.2, # related genres weight
"subgenre_weight": 0.1,
"synopsis_weight": 0.4
}
synopsis_embeddings_matrix = np.vstack(synopsis_embeddings)
weighted_related_genres = weights["genre_weight"] * related_genres_encoded
weighted_subgenres = weights["subgenre_weight"] * subgenres_encoded
weighted_synopses = weights["synopsis_weight"] * synopsis_embeddings_matrix
# scaling the synopsis (dense matrix)
scaler = StandardScaler(with_mean=False)
synopsis_embeddings_scaled = scaler.fit_transform(weighted_synopses)
combined_weighted_features = hstack([
csr_matrix(weighted_related_genres),
csr_matrix(weighted_subgenres),
csr_matrix(synopsis_embeddings_scaled),
])
print(combined_weighted_features)
<Compressed Sparse Row sparse matrix of dtype 'float64' with 742520 stored elements and shape (1913, 541)> Coords Values (0, 0) 0.2 (0, 4) 0.2 (0, 14) 0.2 (0, 116) 0.1 (0, 134) 0.1 (0, 141) 0.1 (0, 157) 1.784592866897583 (0, 158) -0.4816299080848694 (0, 159) -0.26151272654533386 (0, 160) -0.4406612813472748 (0, 161) -1.638753890991211 (0, 162) 0.08578472584486008 (0, 163) 1.0713781118392944 (0, 164) 0.6499630808830261 (0, 165) 0.9495714902877808 (0, 166) 0.268752783536911 (0, 167) 0.04838496446609497 (0, 168) 0.03733789920806885 (0, 169) 0.5523092150688171 (0, 170) -2.791269063949585 (0, 171) 1.151781678199768 (0, 172) 0.28092601895332336 (0, 173) 0.03787020221352577 (0, 174) 1.4218069314956665 (0, 175) -1.508204460144043 : : (1912, 516) -0.4690714180469513 (1912, 517) 1.0671030282974243 (1912, 518) 0.30465206503868103 (1912, 519) 1.0742019414901733 (1912, 520) 2.244892120361328 (1912, 521) 0.0510968416929245 (1912, 522) -8.167990017682314e-05 (1912, 523) -0.5710775256156921 (1912, 524) -0.15032102167606354 (1912, 525) 0.027843547984957695 (1912, 526) 2.2979421615600586 (1912, 527) -0.9206590056419373 (1912, 528) 1.8066314458847046 (1912, 529) 0.3503602147102356 (1912, 530) -2.2840702533721924 (1912, 531) -0.3761787712574005 (1912, 532) -0.22271381318569183 (1912, 533) 0.24022667109966278 (1912, 534) -1.2844871282577515 (1912, 535) -2.121735095977783 (1912, 536) 0.15947629511356354 (1912, 537) -0.06682594120502472 (1912, 538) -0.6463495492935181 (1912, 539) -2.1301190853118896 (1912, 540) 1.453000545501709
Calculate Similarity¶
Cosine Similarity¶
Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space, quantifying how similar the two vectors are irrespective of their magnitude. It ranges from -1 (completely dissimilar) to 1 (identical), with 0 indicating orthogonality (no similarity).
$$ \text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} $$
where,
- $A \cdot B$ is the dot product of two vectors, $A$ and $B$
- $\|A\|$ and $\|B\|$ are the magnitude of the vectors $A$ and $B$
similarity_matrix = cosine_similarity(combined_weighted_features)
similarity_df = pd.DataFrame(
similarity_matrix, index=df['Title'], columns=df['Title'])
similarity_df.head()
| Title | The Inner Game of Music | The Nepalis a pictorial celebration | Lust for Life | The Architecture Book | The World According to Colour | Design Your Thinking | Wabi Sabi : The Wisdom In Imperfection | The Art Book | Colouring book : Copy Colour Fruits and Vegetables | Creativity, Inc.: Overcoming the Unseen Forces That Stand in the Way of True Inspiration | ... | The Time Keeper | The Climb | The Life and Times of the Thunderbolt Kid | Storms of Silence | Annapurna | A glimpse of eternal snows | Tourism in Pokhara | Satori in Paris | Leaving Microsoft to Change the World | The Snow Leopard |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Title | |||||||||||||||||||||
| The Inner Game of Music | 1.000000 | 0.034147 | 0.176279 | 0.201509 | 0.278708 | 0.300259 | 0.230279 | 0.251498 | 0.125580 | 0.124834 | ... | 0.098426 | 0.137469 | 0.144421 | 0.244405 | 0.050204 | 0.138618 | 0.170119 | 0.139562 | 0.183268 | 0.137902 |
| The Nepalis a pictorial celebration | 0.034147 | 1.000000 | 0.118702 | 0.082047 | 0.301799 | 0.136164 | 0.195004 | 0.358356 | 0.129706 | 0.165313 | ... | 0.088339 | 0.186285 | 0.170930 | 0.075160 | 0.436631 | 0.332246 | 0.399275 | 0.275086 | 0.219800 | 0.233057 |
| Lust for Life | 0.176279 | 0.118702 | 1.000000 | 0.264988 | 0.326018 | 0.110824 | 0.163250 | 0.371939 | 0.253391 | 0.257954 | ... | 0.250244 | 0.404889 | 0.427070 | 0.278486 | 0.144082 | 0.252875 | 0.325080 | 0.380746 | 0.283003 | 0.205666 |
| The Architecture Book | 0.201509 | 0.082047 | 0.264988 | 1.000000 | 0.318694 | 0.245561 | 0.203262 | 0.488973 | 0.211905 | 0.159666 | ... | 0.179159 | 0.259368 | 0.222482 | 0.132534 | 0.077794 | 0.236956 | 0.293476 | 0.221182 | 0.191667 | 0.180547 |
| The World According to Colour | 0.278708 | 0.301799 | 0.326018 | 0.318694 | 1.000000 | 0.211711 | 0.236022 | 0.417661 | 0.487917 | 0.188843 | ... | 0.220100 | 0.126099 | 0.304179 | 0.184394 | 0.093187 | 0.327152 | 0.314607 | 0.312270 | 0.206742 | 0.293267 |
5 rows × 1913 columns
type(similarity_matrix)
numpy.ndarray
Spectral Clustering¶
db_scores = []
cluster_range = range(2, 20)
for n_clusters in cluster_range:
spectral = SpectralClustering(n_clusters, affinity='precomputed', random_state=42)
labels = spectral.fit_predict(similarity_matrix)
score = davies_bouldin_score(similarity_matrix, labels)
db_scores.append(score)
print(db_scores)
[2.310745480641241, 2.6262955249402484, 2.0825395549022927, 2.3150347844418984, 2.131239058192411, 2.238069358487908, 2.2495851492530963, 2.3740701938440525, 2.4610600540217056, 2.5060761997933674, 2.517114030275109, 2.6979313709881563, 2.3956730612528703, 2.411985376934093, 2.8241969883224582, 2.538603845476868, 2.405740148752897, 2.542586944219207]
fig = px.line(x=cluster_range,
y=db_scores,
title="Davies Bouldin Index")
fig.update_layout(xaxis_title="Cluster Range",
yaxis_title="Davies Bouldin Score")
fig.show()
n_clusters = 4
spectral_clustering = SpectralClustering(n_clusters=n_clusters,
affinity='precomputed',
assign_labels='kmeans',
random_state=42)
labels = spectral_clustering.fit_predict(similarity_matrix)
import umap
umap_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42, metric='cosine')
embedding = umap_reducer.fit_transform(similarity_matrix)
fig = px.scatter(x=embedding[:, 0],
y=embedding[:, 1],
color=labels.astype(str),
title='Spectral Clustering of Books',
labels={'color': 'Clusters'})
fig.update_layout(xaxis_title="UMAP 1",
yaxis_title="UMAP 2",
height=500)
fig.show()
df["Spectral Cluster"] = labels
def get_recommendations(similarity_df: pd.DataFrame,
df: pd.DataFrame,
title: str,
n: int = 5,
columns: list[str] = [
"Title", "Author", "Genre", "Related Genres", "Subgenres", "Spectral Cluster"]
) -> pd.DataFrame:
idx = similarity_df.index.get_loc(title)
sim_scores = list(enumerate(similarity_df.iloc[idx]))
# sort the books based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# exclude the first one as it's the book itself
sim_scores = sim_scores[1:n+1]
# get the book indices
book_indices = [i[0] for i in sim_scores]
display(df[df["Title"] == title][columns])
# top n most similar books
return df[columns].iloc[book_indices]
get_recommendations(similarity_df,
df,
"Diary of a Wimpy Kid",
n=10)
| Title | Author | Genre | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|---|
| 382 | Diary of a Wimpy Kid | Jeff Kinney | Fiction And Literature | [Fiction And Literature, Kids And Teens] | [Humor, Ages 9 to 12] | 0 |
| Title | Author | Genre | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|---|
| 339 | Diary Of A Wimpy Kid ; The Ugly Truth | Jeff Kinney | Kids And Teens | [Kids And Teens] | [Ages 9 to 12] | 0 |
| 781 | Diary Of A Wimpy Kid: No Brainer | Jeff Kinney | Kids And Teens | [Kids And Teens, Fiction And Literature, Manga... | [Childrens, Humor, Graphic Novels] | 0 |
| 360 | Diary of a Wimpy Kid: Wrecking Ball | Jeff Kinney | Kids And Teens | [Kids And Teens] | [Ages 9 to 12] | 0 |
| 353 | Cabin Fever | Jeff Kinney | Kids And Teens | [Kids And Teens] | [Ages 9 to 12] | 0 |
| 735 | Diary of an Awesome Friendly Kid: Rowley Jeffe... | Jeff Kinney | Fiction And Literature | [Fiction And Literature, Kids And Teens, Manga... | [Humor, Young Adult, Childrens, Graphic Novels... | 0 |
| 656 | DIARY OF A WIMPY KID: THE GETAWAY | Jeff Kinney | Fiction And Literature | [Fiction And Literature, Kids And Teens, Manga... | [Humor, Childrens, Graphic Novels] | 0 |
| 357 | Rodrick Rules | Jeff Kinney | Kids And Teens | [Kids And Teens] | [Ages 6 to 8, Ages 9 to 12] | 0 |
| 431 | Diary Of A Wimpy Kid ; Hard Luck | Jeff Kinney | Kids And Teens | [Kids And Teens] | [Ages 9 to 12] | 0 |
| 369 | Diary of a Wimpy Kid 10. Old School | Jeff Kinney | Fiction And Literature | [Fiction And Literature, Kids And Teens] | [Young Adult, Humor, Ages 9 to 12] | 0 |
| 424 | Diary of a Wimpy Kid: Diper Overlode (Book 17) | Jeff Kinney | Kids And Teens | [Kids And Teens] | [Ages 9 to 12] | 0 |
get_recommendations(similarity_df, df, "Satori in Paris")
| Title | Author | Genre | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|---|
| 1910 | Satori in Paris | Jack Kerouac | Fiction And Literature | [Fiction And Literature] | [Classics, Contemporary] | 3 |
| Title | Author | Genre | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|---|
| 1870 | Lonesome Traveler | Jack Kerouac | Fiction And Literature | [Fiction And Literature] | [Classics] | 0 |
| 1602 | Metamorphosis and Other Stories | Franz Kafka | Fiction And Literature | [Fiction And Literature, Spirituality And Phil... | [Short Story, Classics, Philosophy] | 0 |
| 1553 | Metamorphosis and Other Stories | Franz Kafka | Fiction And Literature | [Fiction And Literature, Spirituality And Phil... | [Modern Classic, Philosophy] | 0 |
| 1149 | Ijajatpatra | Sarthak Karki | Nepali | [Nepali] | [Nepali Literature] | 0 |
| 271 | Sakshi Chetna: Amrita Pritam | Rajesh Chandra | History Biography And Social Science | [History Biography And Social Science, Foreign... | [Memoir, Hindi] | 3 |
K-Nearest Neighbors¶
Model¶
k = 15 # number of recommendations (neighbors)
# KNN using cosine distance
knn = NearestNeighbors(n_neighbors=k, metric='cosine', algorithm='brute')
knn.fit(combined_weighted_features)
NearestNeighbors(algorithm='brute', metric='cosine', n_neighbors=15)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
NearestNeighbors(algorithm='brute', metric='cosine', n_neighbors=15)
def recommend_knn(book_title,
df,
knn_model,
combined_features_scaled,
n_recommendations=5,
columns=['Title', 'Author', 'Related Genres', 'Subgenres', 'Spectral Cluster'],
verbose=True):
try:
book_idx = df[df['Title'] == book_title].index[0]
except IndexError:
return f'Book "{book_title}" not found.'
# get the vector for the book
book_vector = combined_features_scaled[book_idx].reshape(1, -1)
# find k nearest neighbors (including the book itself)
distances, indices = knn_model.kneighbors(
book_vector, n_neighbors=n_recommendations+1)
# get indices of the recommended books (first one is the book itself)
recommended_indices = indices[0][1:]
recommended_distances = distances[0][1:]
# create a df with recommended books and distances
recommendations_df = df.iloc[recommended_indices].copy()
recommendations_df['Distance'] = recommended_distances
if verbose:
display(df[df["Title"] == book_title][columns])
return recommendations_df[columns + ["Distance"]]
recommend_knn("Diary of a Wimpy Kid", df, knn, combined_weighted_features)
| Title | Author | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|
| 382 | Diary of a Wimpy Kid | Jeff Kinney | [Fiction And Literature, Kids And Teens] | [Humor, Ages 9 to 12] | 0 |
| Title | Author | Related Genres | Subgenres | Spectral Cluster | Distance | |
|---|---|---|---|---|---|---|
| 339 | Diary Of A Wimpy Kid ; The Ugly Truth | Jeff Kinney | [Kids And Teens] | [Ages 9 to 12] | 0 | 0.315828 |
| 781 | Diary Of A Wimpy Kid: No Brainer | Jeff Kinney | [Kids And Teens, Fiction And Literature, Manga... | [Childrens, Humor, Graphic Novels] | 0 | 0.331757 |
| 360 | Diary of a Wimpy Kid: Wrecking Ball | Jeff Kinney | [Kids And Teens] | [Ages 9 to 12] | 0 | 0.352462 |
| 353 | Cabin Fever | Jeff Kinney | [Kids And Teens] | [Ages 9 to 12] | 0 | 0.363305 |
| 735 | Diary of an Awesome Friendly Kid: Rowley Jeffe... | Jeff Kinney | [Fiction And Literature, Kids And Teens, Manga... | [Humor, Young Adult, Childrens, Graphic Novels... | 0 | 0.383408 |
recommend_knn("Jay Vudi", df, knn, combined_weighted_features, n_recommendations=10)
| Title | Author | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|
| 1151 | Jay Vudi | Bhairav Aryal | [Nepali] | [Nepali Literature] | 3 |
| Title | Author | Related Genres | Subgenres | Spectral Cluster | Distance | |
|---|---|---|---|---|---|---|
| 1140 | Damini Bhir | Rajan Mukarung | [Nepali] | [Nepali Literature, Nepali Language] | 3 | 0.389024 |
| 1093 | Ghatmandu | Kumar Nagarkoti | [Nepali] | [Nepali Literature, Nepali Language] | 3 | 0.399716 |
| 1149 | Ijajatpatra | Sarthak Karki | [Nepali] | [Nepali Literature] | 0 | 0.402148 |
| 1867 | Mountains painted with turneric | Lil Bahadur Chettri | [Travel] | [Climbing and Mountaineering] | 3 | 0.416109 |
| 195 | Arresting god in kathmandu | Samrat Upadhyay | [Fiction And Literature] | [Short Story, Contemporary] | 3 | 0.422693 |
| 1084 | Lato Pahad | Upendra Subba | [Nepali] | [Nepali Literature] | 3 | 0.424879 |
| 224 | Karnali Blues | Buddhisagar, Michael Hutt (Translator) | [Fiction And Literature] | [Asian Literature, Contemporary] | 3 | 0.430533 |
| 524 | Ratna's basic Nepali dictionary | Shyam P. Wagley, Bijay Kumar Rauniyar | [Learning And Reference] | [Dictionaries] | 3 | 0.431791 |
| 1122 | Kumari Prashnaharu | Durga Karki | [Nepali] | [Nepali Literature] | 3 | 0.459874 |
| 1139 | Nepalese Folklore: Kirati Tales | Shiva Kumar Sheratha | [Nepali] | [Books on Nepal] | 3 | 0.474343 |
recommend_knn("Satori in Paris", df, knn, combined_weighted_features, n_recommendations=15)
| Title | Author | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|
| 1910 | Satori in Paris | Jack Kerouac | [Fiction And Literature] | [Classics, Contemporary] | 3 |
| Title | Author | Related Genres | Subgenres | Spectral Cluster | Distance | |
|---|---|---|---|---|---|---|
| 1870 | Lonesome Traveler | Jack Kerouac | [Fiction And Literature] | [Classics] | 0 | 0.398846 |
| 1602 | Metamorphosis and Other Stories | Franz Kafka | [Fiction And Literature, Spirituality And Phil... | [Short Story, Classics, Philosophy] | 0 | 0.425701 |
| 1553 | Metamorphosis and Other Stories | Franz Kafka | [Fiction And Literature, Spirituality And Phil... | [Modern Classic, Philosophy] | 0 | 0.425715 |
| 1149 | Ijajatpatra | Sarthak Karki | [Nepali] | [Nepali Literature] | 0 | 0.448904 |
| 271 | Sakshi Chetna: Amrita Pritam | Rajesh Chandra | [History Biography And Social Science, Foreign... | [Memoir, Hindi] | 3 | 0.466396 |
| 821 | India My Love | Dominique Lapierre | [Miscellaneous] | [Books on India] | 3 | 0.479485 |
| 1493 | Dharmayoddha Kalki | Kevin Missal | [Fiction And Literature, Spirituality And Phil... | [Fantasy, Mythology] | 3 | 0.485310 |
| 855 | First Person Singular | Haruki Murakami | [Miscellaneous] | [] | 0 | 0.486236 |
| 1128 | Nun Tel | Jeevan Chhetri | [Nepali] | [Nepali Literature, Nepali Language] | 3 | 0.488469 |
| 1151 | Jay Vudi | Bhairav Aryal | [Nepali] | [Nepali Literature] | 3 | 0.489157 |
| 1825 | The Two-Year Mountain | Phil Deutschle | [Travel] | [Travel Guide Books, Climbing and Mountaineering] | 3 | 0.494712 |
| 1819 | The Journey Home: Autobiography of an American... | Radhanath Swami | [Spirituality And Philosophy, History Biograph... | [Philosophy, Biography, Memoir, Autobiography,... | 3 | 0.500038 |
| 1142 | Loo | Nayan Raj Pandey | [Nepali] | [Nepali Literature] | 3 | 0.502328 |
| 282 | Alchemist (Hindi) | Paul Cornell | [Fiction And Literature, Foreign Languages] | [Fantasy, Hindi] | 0 | 0.506093 |
| 1120 | Aja Ramita Chha | Indra Bahadur Rai | [Nepali] | [Nepali Literature] | 3 | 0.506831 |
Recommendation Visualizations¶
recommendations = recommend_knn("The Bell Jar",
df,
knn,
combined_weighted_features,
n_recommendations=20,
verbose=False)
print(recommendations.columns)
Index(['Title', 'Author', 'Related Genres', 'Subgenres', 'Spectral Cluster',
'Distance'],
dtype='object')
Distribution¶
rec_genres = recommendations["Related Genres"].explode().value_counts().reset_index()
rec_subgenres = recommendations["Subgenres"].explode().value_counts().reset_index()
genre_dist_plot = make_subplots(cols=1, rows=2,
subplot_titles=('Distribution of Recommendations by Genre',
'Distribution of Recommendations by Subgenres'),
specs=[[{'type': 'pie'}], [{'type': 'pie'}]])
genre_dist_plot.add_trace(go.Pie(values=rec_genres['count'],
labels=rec_genres['Related Genres'],
name='Related Genres',
hole=0.4,
legendgroup='genre',
showlegend=True),
row=1,
col=1)
genre_dist_plot.add_trace(go.Pie(values=rec_subgenres['count'],
labels=rec_subgenres['Subgenres'],
name='Subgenres',
hole=0.4,
legendgroup='subgenre',
showlegend=True),
row=2,
col=1)
genre_dist_plot.update_layout(
height=800,
title_text="Recommendations Genre and Subgenre Distribution - 'The Bell Jar'",
)
genre_dist_plot.update_traces(
legendgroup="genre",
showlegend=True,
row=1, col=1
)
genre_dist_plot.update_traces(
legendgroup="subgenre",
showlegend=True,
row=2, col=1
)
genre_dist_plot.show()
Distance¶
book_title = "The Bell Jar"
umap_model = umap.UMAP(n_neighbors=15, random_state=42)
umap_embeddings = umap_model.fit_transform(similarity_matrix)
recommended_idx = recommendations.index
# filter the UMAP embeddings to only include recommended books
umap_recommendations = umap_embeddings[recommended_idx]
umap_df = pd.DataFrame(umap_recommendations, columns=["UMAP1", "UMAP2"])
umap_df['Title'] = recommendations['Title'].values
# index of original book
original_book_idx = df[df['Title'] == book_title].index[0]
original_book_umap = umap_embeddings[original_book_idx].reshape(1, -1)
original_book_df = pd.DataFrame(original_book_umap, columns=["UMAP1", "UMAP2"])
original_book_df['Title'] = book_title
umap_df_combined = pd.concat([umap_df, original_book_df], ignore_index=True)
fig = go.Figure()
# Add lines from the original book to every recommended book
for idx, row in umap_df.iterrows():
fig.add_trace(go.Scatter(x=[original_book_df['UMAP1'].values[0], row['UMAP1']],
y=[original_book_df['UMAP2'].values[0], row['UMAP2']],
mode='lines',
line=dict(color='gray', width=0.5),
showlegend=False))
# Add the recommended books
fig.add_trace(go.Scatter(x=umap_df['UMAP1'],
y=umap_df['UMAP2'],
mode='markers',
text=umap_df['Title'],
name='Recommended',
marker=dict(color='#1F77B4', line=dict(width=1, color='DarkSlateGray'))))
# Add the original book trace
fig.add_trace(go.Scatter(x=original_book_df['UMAP1'],
y=original_book_df['UMAP2'],
mode='markers+text',
text=original_book_df['Title'],
name='Original Book',
marker=dict(color='red', size=12, symbol='x')))
fig.update_layout(title="UMAP Projection of Book Recommendations with Original Book", height=500)
fig.show()
Function All That¶
def visualize_recommendations(book_title: str, n_recommendations: int = 10) -> None:
recommendations = recommend_knn(book_title,
df,
knn,
combined_weighted_features,
n_recommendations,
verbose=False)
# Pie Charts
rec_genres = recommendations["Related Genres"].explode().value_counts().reset_index()
rec_subgenres = recommendations["Subgenres"].explode().value_counts().reset_index()
genre_dist_plot = make_subplots(cols=1, rows=2,
subplot_titles=('Distribution of Recommendations by Genre',
'Distribution of Recommendations by Subgenres'),
specs=[[{'type': 'pie'}], [{'type': 'pie'}]])
genre_dist_plot.add_trace(go.Pie(values=rec_genres['count'],
labels=rec_genres['Related Genres'],
name='Related Genres',
hole=0.4,
legendgroup='genre',
showlegend=True),
row=1,
col=1)
genre_dist_plot.add_trace(go.Pie(values=rec_subgenres['count'],
labels=rec_subgenres['Subgenres'],
name='Subgenres',
hole=0.4,
legendgroup='subgenre',
showlegend=True),
row=2,
col=1)
genre_dist_plot.update_layout(
height=800,
title_text=f"Recommendations Genre and Subgenre Distribution - {book_title}",
)
genre_dist_plot.update_traces(
legendgroup="genre",
showlegend=True,
row=1, col=1
)
genre_dist_plot.update_traces(
legendgroup="subgenre",
showlegend=True,
row=2, col=1
)
genre_dist_plot.show()
# Cluster Distance
umap_model = umap.UMAP(n_neighbors=n_recommendations, min_dist=0.1, random_state=42, metric='cosine')
umap_embeddings = umap_model.fit_transform(similarity_matrix)
recommended_idx = recommendations.index
# filter the UMAP embeddings to only include recommended books
umap_recommendations = umap_embeddings[recommended_idx]
umap_df = pd.DataFrame(umap_recommendations, columns=["UMAP1", "UMAP2"])
umap_df['Title'] = recommendations['Title'].values
# index of original book
original_book_idx = df[df['Title'] == book_title].index[0]
original_book_umap = umap_embeddings[original_book_idx].reshape(1, -1)
original_book_df = pd.DataFrame(original_book_umap, columns=["UMAP1", "UMAP2"])
original_book_df['Title'] = book_title
umap_df_combined = pd.concat([umap_df, original_book_df], ignore_index=True)
fig = go.Figure()
# add lines from the original book to every recommended book
for idx, row in umap_df.iterrows():
fig.add_trace(go.Scatter(x=[original_book_df['UMAP1'].values[0], row['UMAP1']],
y=[original_book_df['UMAP2'].values[0], row['UMAP2']],
mode='lines',
line=dict(color='gray', width=0.5),
showlegend=False))
# Add the recommended books
fig.add_trace(go.Scatter(x=umap_df['UMAP1'],
y=umap_df['UMAP2'],
mode='markers',
text=umap_df['Title'],
name='Recommended',
marker=dict(color='#1F77B4', line=dict(width=1, color='DarkSlateGray'))))
# Add the original book trace
fig.add_trace(go.Scatter(x=original_book_df['UMAP1'],
y=original_book_df['UMAP2'],
mode='markers+text',
text=original_book_df['Title'],
name='Original Book',
marker=dict(color='red', size=12, symbol='x')))
fig.update_layout(title="UMAP Projection of Book Recommendations with Original Book", height=500)
fig.show()
visualize_recommendations(book_title="Diary of a Wimpy Kid")
Validation¶
Genre Diversity¶
We could determine the genre diversity of the recommendations using the Simpson's Diversity Index.
Simpson's Diversity Index (SDI) is a measure of diversity that takes into account both the number of categories (e.g., genres) and the relative abundance of each category.
$$ D = 1 - \sum p^2_i $$
where,
- $D$ is the Simpson's Diversity Index, ranging from 0 to 1 (values closer to 1 represent higher diversity)
- $p_i$ is the proportion each category $i$ relative to the total
def diversity_score(recommendations, feature="Related Genres"):
feature_counts = recommendations[feature].explode().value_counts()
feature_proportions = feature_counts / feature_counts.sum()
display(feature_proportions)
# Simpson's Diversity Index
diversity = 1 - (feature_proportions ** 2).sum()
return diversity
diversity_score(
recommend_knn("The Bell Jar", df, knn, combined_weighted_features, n_recommendations=20)
)
| Title | Author | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|
| 307 | The Bell Jar | Sylvia Plath | [Fiction And Literature, History Biography And... | [Poetry and Prose, Classics, Psychology, Femin... | 0 |
| 316 | The Bell Jar | Sylvia Plath | [Fiction And Literature, History Biography And... | [Classics, Psychology, Feminism] | 0 |
Related Genres Fiction And Literature 0.484848 History Biography And Social Science 0.303030 Self Improvement And Relationships 0.090909 Spirituality And Philosophy 0.060606 Arts And Photography 0.030303 Kids And Teens 0.030303 Name: count, dtype: float64
0.6593204775022956
diversity_score(
recommend_knn("Jay Vudi", df, knn, combined_weighted_features, n_recommendations=20)
)
| Title | Author | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|
| 1151 | Jay Vudi | Bhairav Aryal | [Nepali] | [Nepali Literature] | 3 |
Related Genres Nepali 0.458333 Fiction And Literature 0.208333 Spirituality And Philosophy 0.125000 Religion 0.083333 Travel 0.041667 Learning And Reference 0.041667 History Biography And Social Science 0.041667 Name: count, dtype: float64
0.71875
diversity_score(
recommend_knn("Beautiful World, Where Are You",
df,
knn,
combined_weighted_features,
n_recommendations=20)
)
| Title | Author | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|
| 240 | Beautiful World, Where Are You | Sally Rooney | [Fiction And Literature] | [Contemporary, Romance] | 0 |
Related Genres Fiction And Literature 0.517241 History Biography And Social Science 0.137931 Kids And Teens 0.103448 Foreign Languages 0.068966 Spirituality And Philosophy 0.034483 Nature 0.034483 Self Improvement And Relationships 0.034483 Manga And Graphic Novels 0.034483 Lifestyle And Wellness 0.034483 Name: count, dtype: float64
0.6920332936979786
Intra-List Similarity¶
def intra_list_similarity(recommendations, embedding_matrix):
total_similarity = 0
count = 0
for rec_indices in recommendations:
if len(rec_indices) < 2:
continue
list_embeddings = embedding_matrix[rec_indices].toarray()
# calculate cosine similarities within the list
similarities = cosine_similarity(list_embeddings)
# sum of lower triangle (to count unique pairs only)
total_similarity += np.tril(similarities, -1).sum()
count += (len(rec_indices) * (len(rec_indices) - 1)) / 2
ils = total_similarity / count if count > 0 else 0
return ils
book_titles = ["The Bell Jar", "Beautiful World, Where Are You", "Jay Vudi"]
recommendation_indices = []
for title in book_titles:
recs_df = recommend_knn(title, df, knn, combined_weighted_features, n_recommendations=k, verbose=False)
rec_indices = recs_df.index.to_list()
recommendation_indices.append(rec_indices) # Append to the main list
# Calculate ILS across all recommendation lists
ils_score = intra_list_similarity(recommendation_indices, combined_weighted_features)
print("Intra-List Similarity (ILS):", ils_score)
Intra-List Similarity (ILS): 0.45361794864451843
df[df["Title"].str.startswith("Captain Underpants")]["Title"].values
array(['Captain Underpants and the revolting revenge of the radioactive robo-boxers',
'Captain Underpants and the Captain Underpants and the Wrath of the Wicked Wedgie Woman'],
dtype=object)
# children's books
book_titles = ["Peter Pan",
"Diary of a Wimpy Kid",
"The Jungle Book"]
recommendation_indices = []
for title in book_titles:
recs_df = recommend_knn(title, df, knn, combined_weighted_features, n_recommendations=k, verbose=False)
rec_indices = recs_df.index.to_list()
recommendation_indices.append(rec_indices)
# Calculate ILS across all recommendation lists
ils_score = intra_list_similarity(recommendation_indices, combined_weighted_features)
print("Intra-List Similarity (ILS):", ils_score)
Intra-List Similarity (ILS): 0.5094732627661639
print(df.query("Title == 'Satori in Paris'")["Synopsis"].values[0])
print("\n\n")
print(df.query("Title == 'India My Love'")["Synopsis"].values[0])
This semi-autobiographical tale of Kerouac's own trip to France, to trace his ancestors and explore his own understanding of the Buddhism that came to define his beliefs, contains some of Kerouac's most lyrical descriptions. From his reports of the strangers he meets and the all-night conversations he enjoys in seedy bars in Paris and Brittany, to the moment in a cab he experiences Buddhism's satori - a feeling of sudden awakening - Kerouac's affecting and revolutionary writing transports the reader. Published at the height of his fame, Satori in Parisis a hectic tale of philosophy, identity and the powerful strangeness of travel. Five Past Midnight in Bhopal and The City of Joy This is the extraordinary story of Dominique Lapierre’s love affair with India, from his first 20,000 kilometre drive across the subcontinent in a veteran Silver Cloud Rolls-Royce gathering unique testimonies for his epic account of India’s independence, to his later encounters with the country’s disinherited and its saints, who taught him a wonderful lesson in sharing and hope and gave birth to the internationally renowned book and film, “The City of Joy”. It is a tale of maharajas and rickshaw pullers, of interviews with Indira Gandhi and the brother of Gandhi’s assassin, of life-changing meetings with Mother Teresa and the victims of the Bhopal disaster, of pig sticking on horseback, of life in the slums with a Swiss nurse, and of saving a home for children affected by leprosy. Above all, it is an insight into how India, with its immense mosaic of people and fascinating culture stole a Frenchman’s heart and turned his life into a testimony to the fact that “All that is not given is lost”.
Export Combined Matrix¶
npz is a numpy file format that stores array data using gzip
from scipy import sparse
sparse.save_npz("out/combined_weighted_feature.npz", combined_weighted_features)
combined_scaled_features = sparse.load_npz("out/combined_weighted_feature.npz")
recommend_knn("The Bell Jar", df, knn, combined_weighted_features, n_recommendations=15)
| Title | Author | Related Genres | Subgenres | Spectral Cluster | |
|---|---|---|---|---|---|
| 307 | The Bell Jar | Sylvia Plath | [Fiction And Literature, History Biography And... | [Poetry and Prose, Classics, Psychology, Femin... | 0 |
| 316 | The Bell Jar | Sylvia Plath | [Fiction And Literature, History Biography And... | [Classics, Psychology, Feminism] | 0 |
| Title | Author | Related Genres | Subgenres | Spectral Cluster | Distance | |
|---|---|---|---|---|---|---|
| 230 | It Ends With Us | Colleen Hoover | [Fiction And Literature] | [Romance, Contemporary] | 0 | 0.450053 |
| 208 | The Seven Husbands of Evelyn Hugo | Taylor Jenkins Reid | [Fiction And Literature] | [Adult Fiction, Romance] | 0 | 0.455673 |
| 333 | The Unabridged Journals of Sylvia Plath | Sylvia Plath | [Fiction And Literature] | [Poetry and Prose, Classics] | 0 | 0.458800 |
| 1607 | What I Know For Sure | Oprah Winfrey | [Self Improvement And Relationships, History B... | [Self Help, Motivational, Biography, Autobiogr... | 0 | 0.470190 |
| 1827 | Eat Pray Love | Elizabeth Gilbert | [History Biography And Social Science, Fiction... | [Autobiography, Biography, Memoir, Womens Fict... | 0 | 0.479998 |
| 317 | Jane Eyre | Charlotte Brontë | [Fiction And Literature, History Biography And... | [Romance, Classics, History] | 0 | 0.489899 |
| 937 | Upstream | Mary Oliver | [Fiction And Literature, History Biography And... | [Poetry and Prose, Short Story, Memoir] | 0 | 0.495724 |
| 1507 | Eleven Minutes | Paulo Coelho | [Fiction And Literature, Spirituality And Phil... | [Contemporary, Drama, Romance, Philosophy] | 0 | 0.501274 |
| 1588 | The Forty Rules of Love | Elif Shafak | [Fiction And Literature, Spirituality And Phil... | [Historical Fiction, Romance, Philosophy] | 0 | 0.502542 |
| 1443 | Everything I Know about Love | Dolly Alderton | [Self Improvement And Relationships, History B... | [Self Help, Memoir] | 0 | 0.509280 |
| 253 | There's No Place Like Here | Cecelia Ahern | [Fiction And Literature] | [Mystery, Thriller and Suspense, Fantasy, Wome... | 0 | 0.519018 |
| 876 | Christmas at Tuppenny Corner | Katie Flynn | [Fiction And Literature] | [Contemporary] | 0 | 0.519239 |
| 161 | Beach Read | Emily Henry | [Fiction And Literature] | [Womens Fiction, Contemporary, Romance] | 0 | 0.519777 |
| 240 | Beautiful World, Where Are You | Sally Rooney | [Fiction And Literature] | [Contemporary, Romance] | 0 | 0.522462 |
| 311 | The Diary of a young girl | Anne Frank | [History Biography And Social Science, Fiction... | [Biography, History, Classics] | 0 | 0.523078 |